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Abstract 

Classical machine learning algorithms often face scalability bottlenecks when they are applied to 
large-scale data. Such algorithms were designed to work with small data that is assumed to fit in the 
memory of one machine. In this report, we analyze different methods for computing an important 
machine leaning algorithm, namely Principal Component Analysis (PCA), and we comment on its 
limitations in supporting large datasets. The methods are analyzed and compared across two important 
metrics: time complexity and communication complexity. We consider the worst-case scenarios for both 
metrics, and we identify the software libraries that implement each method. The analysis in this report 
helps researchers and engineers in (i) understanding the main bottlenecks for scalability in different 
PCA algorithms, (ii) choosing the most appropriate method and software library for a given application 
and data set characteristics, and (iii) designing new scalable PCA algorithms. 


I. Introduction 

Enormous amounts of data are being generated every day from soeial networks, sensors and 
web sites. This data often bas signifieant value for business, seienee, government, and soeiety. 
Owners of this data strive to extraet useful information out of it, often by applying maebine 
learning algorithms. Distributed maebine learning algorithms are needed beeause of the large data 
volumes and the need to obtain fast results. Distributed maebine learning algorithms, however, 
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introduce a new set of ehallenges. For example, during the distributed exeeution of a machine 
learning algorithm, proeessing nodes may need to exehange data among eaeh other, whieh we 
eall intermediate data. If not earefully managed, this intermediate data may aetually beeome the 
main bottleneek for sealing machine learning algorithms, regardless of the available number of 
eomputing nodes. 

In this report, we foeus on analyzing the methods for eomputing prineipal eomponent analysis 
(PCA). We analyze all methods aeross two important metries: time eomplexity and eommuni- 
eation eomplexity. We eonsider the worst-ease seenarios for both metries. The time eomplexity 
is the upper bound on the number of eomputational steps needed by the algorithm to terminate. 
The eommunieation eomplexity is the worst-ease total size of the intermediate data. In addition, 
during our analysis, we identify the methods implemented in eommon libraries sueh as Mahout, 
MLlib, and SeaLAPACK. Mahout [[U is a eolleetion of maehine learning algorithms implemented 
on Hadoop MapReduee. MLlib O is a Spark implementation of some eommon maehine learning 
algorithms. SeaLAPACK [|3l| is a library of linear algebra algorithms implemented for parallel 
distributed memory maehines. 

The analysis in this report helps researehers and engineers in (i) understanding the main 
bottleneeks for sealability in different PCA algorithms, (ii) ehoosing the most appropriate method 
and software library for a given applieation and data set eharaeteristies, and (iii) designing new 
sealable PCA algorithms: e.g., the work in 0. To the best of our knowledge, sueh a rigorous 
analysis was never done before, and it is erueial for seleeting the proper PCA method for different 
environments and datasets. 

The notation we use in this report is mostly eonsistent with Matlab’s programing language. 
Variable names, ineluding matriees, are eomposed of one or more letters. Multiplieation is 
indieated with a star (*) between the variables. M' and are the transpose and inverse 
of matrix M, respeetively. 1 is the identity matrix. Furthermore, Mi denotes row i of matrix M. 
We use Mj to refer to the yth element of veetor M,. 

The rest of this report is organized as follows. In Seetion |nl we present the details of our 
distributed exeeution eost model that we use to analyze different PCA methods. In Seetion Uni 
we analyze the first method for eomputing PCA whieh is based on eigenvalue deeomposition of 
the eovarianee matrix. In seetion |IVl we analyze the methods for eomputing PCA that are based 
on Singular Value Deeomposition (SVD). In Seetion |Vl we analyze a PCA method based on a 
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variant of SVD called Stochastic SVD. In Section |Vll we analyze a PC A method based on the 
Probalistic Principal Component Analysis algorithm (PPCA). Section IVIII presents a summary 
of the analysis, and Section IVTTTI concludes the report. 

II. Distributed Execution Cost Model 

We analyze all methods across two important metrics: time complexity and communication 
complexity. We consider the worst-case scenarios for both metrics. The time complexity is the 
upper bound on the number of computational steps needed by the algorithm to terminate. Some 
PCA algorithms run multiple iterations of the same code, where each iteration improves the 
accuracy of its predecessor by starting from a better initial state. The time complexity that we 
present is for a single iteration, as the number of iterations is typically bounded by a small 
constant. 

During the distributed execution of a PCA algorithm, processing nodes may need to exchange 
data among each other, which we call intermediate data. The worst-case total size of the 
intermediate data is considered as the communication complexity. We note that most PCA 
algorithms work in multiple synchronous phases, and the intermediate data is exchanged at 
the end of each phase. That is, a phase must wait for the entire intermediate data produced 
by its predecessor phase to be received before its execution starts. Therefore, a large amount 
of intermediate data will introduce delays and increase the total execution time of the PCA 
algorithm, and hence the intermediate data can become a major bottleneck. The exact delay will 
depend on the cluster hardware (network topology, link speed, EO speed, etc.) as well as the 
software platform used to manage the cluster and run the PCA code. Some software platforms, 
e.g., Hadoop/MapReduce, exchange intermediate data through the distributed storage system, 
while others, e.g.. Spark, exchange data through shared virtual memory. For our analysis of 
communication complexity to be general, we consider the total number of bytes that need to be 
exchanged, and we abstract away the details of the underlying hardware/software architecture. 

III. Eigenvalue Decomposition of the Covariance matrix 

A simple way of computing PCA of a matrix A is to compute the eigenvalue decompositon of 
its covariance matrix. Given N x D matrix A, a target dimensionality d, the algorithm computes 
an N xd matrix V such that the columns of V are the principal components of A. The algorithm 
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for computing the eigenvalue deeomposition of the eovarianee matrix ean be summarized in the 
following steps: 

• Stepl: Compute the mean-centered matrix Ac 

The mean-centered matrix Ac is eomputed by subtracting the veetor of all the column means 
of matrix A from eaeh row of A. 

• Step 2: Compute the covariance matrix Cv =a[*Ac 

The eovarianee matrix is computed by multiplying the mean-eentered matrix Ac with its 
transpose. This is a eomputationally intensive step beeause it requires multiplying two 
matriees where typieally none of them ean fit in memory. 

• Step 3: Compute the eigenvalue decomposition of Cv 
Eigenvalues A/ and eigenveetors v/ satisfy the relation: 

Cv * A/ = A/ * Vi, 

where A, is the eigenvalue and vt is its eorresponding eigenveetor. Putting the above 
formula in the matrix form: 

Cv * A = A * y, 

where A is a diagonal matrix whose elements are the eigenvalues of the eovarianee matrix 
Cv, and V is the matrix whose eolumns are the eorresponding eigenvectors.The eigenvalue 
deeomposition of the matrix Cv is given by: 

Cv = A*y * A^^ 

Several algorithms are used to perform sueh faetorization [|T2|; they inelude the QZ algo¬ 
rithm [fT4]l . and Cholesky factorization lfT3ll when the matrix is symmetrie. 

• Step 4: Get the principal components 

The prineipal eomponents of the input matrix A are the eigenveetors assoeiated with the 
largest eigenvalues. Since the eigenvalues in the diagonal matrix A are sorted in a deereasing 
order, then the first d veetors in the matrix V are the prineipal eomponents of A: V = 

(vi,V2,...,Vrf). 

Time Complexity: The algorithm has two eomputationally intensive steps: eomputing the 
covariance matrix, and eomputing the eigenvalue decomposition of the eovarianee matrix. The 
eomputational eomplexity of the eovarianee matrix eomputations is 0{ND x min{N,D)) whieh 
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is a result of multiplying two matrices of size DxN and N x D, respectively. The other com¬ 
putationally intensive computation is the eigenvalue decomposition. The work in |[T^ analyzes 
the computational complexity of several algorithms for eigenvalue decomposition. The worst 
case complexity of such algorithms is 0{D^) for a matrix of size DxD. Therefore the overall 
complexity is 0{ND x min{N,D) +D^). 

Communication Complexity: In addition to the computational cost, the covariance matrix is 
large and dense {D x D) which incurs substantial communication cost 0{D^), making the method 
not suitable for large datasets. 

Additional Notes: The algorithm described in this section is implemented in RscaLAPACK. 
RscaLAPACK is an add-on package for the R programming language. R is an open-source prgo- 
ramming language used extensively in statistics and data mining tasks. RscaLAPACK provides 
a scalable PCA implementation that can be called from an R program and uses the parallel 
linear algebra routines implemented in the ScaLAPACK library. It is worth mentioning that 
RscaLAPACK includes another scalable PCA implementation that uses singular value decocm- 
position (SVD). RscaLAPACK documentation mentions that the latter is the prefered method 
for computing PCA. In the next sections we describe the SVD-based algorithms including the 
one implemented in RscaLAPACK. 

IV. Singular Value Decomposition (SVD) 

Another way of computing PCA is done by a singular value decomposition (SVD) of a mean- 
centered matrix. SVD decomposes matrix A into three matrices U, E and V. If A is mean-centered 
(i.e., subtracted by the column mean of A), V gives the principal components of matrix A. 

Several methods have been proposed to compute the singular value decomposition of a 
matrix. Section IIV-AI describes early methods for computing SVD for dense matrices. Such 
methods have evolved over time and they are currently being used by some libraries and 
frameworks. Section IIV-BI describes variants of SVD algorithms that are optimized for sparse 
matrices. Such algorithms offer substantial speedup and efficient memory usage when matrices 
are sparse. Section |V] explains an approximate singular value decompostion algorithm, referred 
to as Stochastic SVD (SSVD). The algorithm uses random sampling techniques to compute SVD 
and provides faster and more scalable means of SVD computation. The algorithm is implemented 
in the popular Mahout/MapReduce machine learning library. 
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Fig. 1. Flowchart of the singular value decomposition (SVD) algorithm in LAPACK/ScaLAPACK. 


A. Dense Singular Value Decomposition 

Golub and Kahan (1965) introduced a two-step approaeh for eomputing SVD for a given 
N xD matrix A. In the first step, A is redueed to a bidiagonal matrix B. The seeond step finds 
SVD of the bidiagonal matrix B. Demmel and Kahan [[6l extended it to a three-step proeedure. 
They added a QR deeomposition step before the bidiagonalization. Thus, bidiagonalization is 
performed for the triangular matrix R instead of matrix A. The algorithm deseribed in this seetion 
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is implemented in RseaLAPACK. Figure [U shows the flowehart of it. The figure shows the names 
of the LAPACK routines used to perform eaeh step. In the rest of the seetion, we analyze the 
extended algorithm deseribed whieh ean be summarized as: 

• Step 1: Compute QR decomposition of A 

When N ^ D, it is more effieient to first perform a QR deeomposition of A. The deeompo- 
sition reduees the matrix A into an orthonormal matrix Q and a triangular matrix R. SVD 
is then performed more efficiently on the triangular matrix R instead of A. Since A = Q*R, 
and R = U *'L*V (described in Step 2), then the SVD of matrix A is given by: 

A = Q * U 

• Step 2: Reduce R to bidiagonal matrix B 

As mentioned in the previous step, SVD is performed on the triangular matrix R. This is 
done in two steps; 1) Computing bidagonal matrix B and 2) computing the SVD of B. 
In this step R is reduced to a bidiagonal matrix B using a series of Givens rotations as 
described in iQ. In each iteration, R is multiplied by a rotation matrix G,. Each rotation 
converts one of the elements of R to zero. The series of rotations continues until matrix R 
is converted to a bidiagonal matrix B. The output of this step is: 

R = Ui *B 

where each of Ui and Vi is a series of rotations and B is the bidiagonal matrix computed 
in this step. Ui and Vi are defined by: 

t'i=nG.. ''i=nGj. 

i j 

«U2 = 1^2,3,...,r. ij^j, 

where r is the is the total number of rotations required to convert matrix R to the bidiagonal 
matrix B. 

• Step 3: Compute singular value decompostion (SVD) of the bidiagonal matrix B 

In this step, the QR iteration algorithm is used to compute the SVD of the bidiagonal matrix 
B. To avoid confusion, it should be noted that the QR iteration algorithm is different from 
QR decomposition that is applied in Step 1. The QR iteration algorithm described in this 


step computes SVD of a matrix, unlike the one described in Step 1 that decomposes a 
matrix into two matrices Q and R. 

Before describing how QR iteration algorithm is applied for bidigonal matrices, we describe 
how the algorithm works for an arbitrary real matrix O that is not necessarily bidiagonal. 
For iteration i, QR decomposition of matrix O is computed as Oj = Q*R, where Q is an 
orthonormal matrix and R is upper triangular matrix with positive diagonal. In the next 
iteration, Oi+i is set to =R*Q. Since matrix Q is orthornormal (i.e. qI *Q = I) then: 

Oi+i =R*Q = Q *Q*R*Q = Q *Oi*Q. 


From the previous equation, it is shown that Oj and are similar matrices. Hence, 
they have the same eigenvalues. Two matrices A and B are similar if B = * A* P for 

some invertible matrix R Moreover, it is proved in |[8l that as i —)■ oo, Oi converges to a 
triangular matrix with the eigenvalues on the diagonal. This algorithm may be applied to the 
bidiagonal singular value problem as follows [I5]|. Let Bq be our initial bidiagonal matrix. 
Given Bj, compute the QR decomposition Bf^B- = Qi*Ri and B\*Bi = Q 2 *R 2 - Then let 
^i+i = 2i * Bi * Q 2 . Observe that = ^1 * 2i and 5-^^ *= R 2 * 22- Hence, 

the usual QR iteration algorithm that was applied on G, is applied to 5/ * B- and B^ * Bi 
simultaneously. In [|5l, it is shown that Bi is bidiagonal for all i and converges as i —)■ 0 ° to 
a diagonal matrix with the singular values on the diagonal. Assuming that 5/ converges in 
the second iteration then the equation can be written as: 

B2 = Qi*B\*Q2 = Qi*Qx*Bq*Q2*Q2- 

Where B = Bq h the original bidiagonal matrix. Finally, the singular value decomposition 
of matrix B can be represented as: 

B = (2i *Qi)~^ *B2* {Q2*■, 


which we write as: 


B = U 2 * L * V2- 


Step 4: Form the singular values and singular vectors of A 

According to the above steps, we can deduce the following formulas: A = Q*R (Step 1), 
A = Q*Ui*B*Vi (Step 2), and A = Q*Ui*U 2 *'L*V 2 *Vi (Step 3). Since the SVD of A 


is given by A = the left singular veetors U and the right singular veetors v' ean 

be eomputed by the following formulas: 

U = Q*Ui*U 2, v'=V2*Vi. 

The eolumns of V' are the prineipal eomponents of the input matrix A if A is mean-eentered 
(i.e., subtraeted by the eolumn mean of A) 

Time Complexity: As shown in Figure [B RseaLAPACK PCA implementation uses routines 
xGEQRF, xGBBRD, xORGBR. Performanee analysis study of sueh routines preseneted in ifT^ 
shows that the floating point operations eount for eaeh of the above routine is either quadratie 
or eubie in the number of dimensions. More speeifieally, [fT^ shows that the floating point 
operations eount performed by the first two routines is 0{ND'^ +D^). 

Communication Complexity: A disadvantage of this algorithm is that it invokes several 
routines and eaeh routine eomputes an intermediate output whieh results in eommunieation 
overhead. For large-seale data, eommunieation eosts ean beeome a bottleneek for sealabilty. The 
algorithm deseribed in this seetion eonsists of three main steps that produee intermediate output: 
QR deeomposition and bidiagonalization and SVD eomputation of the bidiagonal matrix. First, 
QR deeomposition of the N x D matrix A results in two matriees, N x d matrix Q and dx D 
matrix R. Therefore, the order of intermediate data in this step is 0{{N -\- D)d). Seeond, the 
bidigonalization step of the matrix R results in three matriees: dx d matrix Ui, dxD matrix 
B, and DxD matrix Vi. Third, the SVD eomputation of the bidigonal matrix B results in three 
matriees of the same dimensions as the ones eomputed in the bidiagonalization step. Henee, the 
size of intermediate data of eaeh of the previous two steps is 0{d^ -{-Dd + D^) = 0{D^). 

To summarize, the total amount of intermediate data generated by the algorithm aeeross the 
three steps is: 0{max{N + D)d,D^). In this analysis, we assume that all elements of the matriees 
are stored even if some matriees are sparse. This is not praetieal in real-world applieations, 
however, it provides an upper bound on the amount of intermediate data. 

B. Singular Value Decomposition for Sparse Matrices 

SVD implementations optimized for sparse matriees have been explored by some large-seale 
maehine learning libraries. Mahout provides a Lanezos SVD algorithm that takes a sparse matrix 
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as an input. The computational complexity of the algorithm employed by Mahout is 0{Nz^), 
where z is the number of non-zero dimensions (out of D dimensions). We refer to this method 
as SVD-Lanczos, and it is implemented in popular libraries such as Mahout and Graphlab. The 
SVD-Lanczos algorithm, however, is not efficient for performing PC A for large datasets. This is 
because the matrix must be mean-centered in order to obtain the principal components as a result 
of SVD. Since in many practical applications the mean of the matrix is not zero, subtracting the 
mean from a sparse matrix will substantially decrease its sparsity. In this case, z will approach 
the full dimensionality D, and the cost for computing PCA using SVD-Lanczos will be 0{ND^), 
which is prohibitive for large datasets. Graphlab provides the Lanczos SVD implmentation 
described in [fTTl . It is a slightly different variant of the algorithm implemented in Mahout. 
The work in ifTTll argues that the quality of the solution of the algorithm implemented in Mahout 
deteriorates very rapidly when the number of iterations exceeds a certain value (5-10), and the 
solution becomes no longer accurate. Graphlab avoids this problem by supporting a restarted 
variant of the Lanczos algorithm in order to enhance the quality of the output eigenvectors. In 
this variant, the number of iterations of the Lanczos algorithm is set to a fixed value, and when 
this value is reached the algorithm is re-initiated with a better initial vector. The method for 
choosing the enhanced initial vector is described in ifTTI . 

V. Stochastic Singular Value Decomposition (SSVD) 

Randomized sampling techniques have recently gained interest in solving large-scale linear 
algebra problems. Halko et al. ifTOll discussed randomized methods to compute approximate 
factorization of matrices (e.g., SVD). This work proposed a modular framework that computes 
approximate SVD in two stages. Stage I computes a low dimensional matrix that approximates 
the range of the input matrix, which is done via randomized sampling. Stage II uses determenistic 
techniques to compute the SVD of the low-dimensional matrix computed in stage 1. Halko et 
al. [fT^ describe a set of randomized techniques used in stage I for rapidly constructing low¬ 
dimensional approximation matrices. Randomized methods are often faster and enable compu¬ 
tations at large scale compared to the determenistic methods. Furthermore, randomized methods 
have the advantage of providing a balance between speed and accuracy. 

In the rest of this section, we focus on the analysis of the Stochastic Singular Value Decom¬ 
position (SSVD) algorithm implemented in Mahout, since the algorithm is designed to scale for 
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large datasets and this is the focus of our work. 

1) Stage I 

This stage requires an orthonormal matrix Q where A k, Q* *A. Q should contain as 
few columns as possible and more importantly, Q should approximate the range of A as 
much as possible (i.e., A is as close as possible to Q*Q' * A). The algorithm to solve this 
problem can be formally defined as follows: Given N x D matrix A, a target dimensionality 
d, and an oversampling parameter p. The algorithm computes an N x (d + p) matrix Q that 
has orthonormal columns and approximates the range of A. A matrix Q approximates the 
range of A if any vector of Q is represented as a linear combination of the vectors of matrix 
A. The number of extra samples p is small and adds a great deal of accuracy for a small 
computational cost. As mentioned in [fTOll . p is typically set to 5 or 10, however it can be 
set to p = d without changing the asymptotic complexity of the algorithm. In the rest of 
the discussion, we denote the term (d + p) by the symbol /. The steps for computing the 
matrix Q is described as follows: 

• Step 1: Draw a Gaussian matrix Q. 

Suppose that we are seeking a matrix with d dmensions to approximate the range of 
Matrix A. The idea is to sample d random vectors = 1,2,.. .J. For each random 
vector we compute =A*co^'\ Owing to randomness, the set of vectors are likely 
to be linear. Hence, each vector is a linear combination of the vectors of A. The 
matrix comprising all the vectors z*^'^ can be regarded as an approximate for the range 
of A. The set of vectors (O^'^ are sampled from a standard Gaussian distribution and they 
form the matrix Q. 

• Step 2: Compute the matrix Z = AO 

As mentioned in the previous step, the product A * O results in an A x / matrix Z that 
approximates the range of A. Hence, Z fulfils one of the requirements of the output 
matrix Q. However, the columns of Z are not necessarily orthonormal which is the other 
requirement on Q. Step 4 computes orthornormal vectors using matrix Z. 

• Step 3: Perform power iterations 

In the cases when the input matrix A has slowly decaying singular values (i.e. the variance 
varies slightly accross its dimensions), matrix Z does not accurately approximate matrix 
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A. The power iteration teehnique is empoloyed to improve aeeuracy in sueh situation. The 
motivation behind this teehnique is that singular veetors assoeiated with small singular 
values interfere with the eomputation. So the idea is to multiply the matrix A by A and 
alternatly in each iteration. This results in a matrix with higher power that has the same 
singular vectors as the matrix A but the weight of singular vectors associated with small 
singular values decreases in the new matrix. The resultant matrix denoted by (A ^AY *A 
has rapidly decaying singular values. In summary, this step iteratively updates the value 
of matrix Z according to the below formula: 

Z={A*A)*A*Q, t = l,2,...j, 

where j is the number of power iterations which is usually a value between 2 and 5. 

• Step 4: Compute QR decomposition of matrix Z 
In order to get orthonormal columns from matrix Z, the matrix is decomposed into two 
matrices Q and R, such that Q is an orthornoraml matrix and R is an upper triangular 
matrix. Matrix Q computed in this step is a low-dimensional orthomormal matrix and 
accurately approximates the range of input matrix A which are the requirements on the 
output matrix from Stage I of this algorithm. 

QR decompositon of matrix Z is performed by a series Givens rotations described in dH . 
In each iteration, Z is multiplied by a rotation matrix G/. Each rotation converts one of 
the elements of Z to zero. The series of rotations continues until matrix Z is converted 
to a triangular matrix R, the rotation matrices Gi,G 2 ,...Gj are multiplied to form the 
orthogonal matrix Q. Assuming that Z is 3 x 3 matrix. It needs three Givens rotatations 
to be converted to an upper triangular matrix R. The three rotations Gi, G 2 and G 3 zeroes 
the elements Z 3 i,Z 2 i,andz 32 respectively. The product of the rotation matrices forms the 
matrix q' (i.e., inverse of matrix Q): 

Q = Gi *G2*G3, 

and the matrix R is defined as: 

R = Gi * G 2 * G 3 *Z = Q *Z. 

Therefore, the QR decomposition of Z is defined as: Z = Q*R. 
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2) Stage II 

Given the approximate matrix Q eomputed in stage I, Q is used to eompute the SVD of 
the original matrix A. The following steps are performed to eompute the SVD: 

• Step 1: Compute matrices B = q' * A and b' *B 

In this step, the matrix B is formed by multiplying the transpose of the orthonormal matrix 
Q with the original matrix A. Additionally, a relatively small matrix B *B is eomputed 
in this step. Sueh matrix has I x I dimensions so it can be more efficiently stored and 
used in the following steps. 

• Step 2: Eigenvalue decomposition of 5 * 5 

Eigenvalue decomposition is used to factorize matrix B *B to form the term B *B = 
U where the columns of U are the eigenvectors of B and E is a diagonal matrix 

whose entries are the eigenvalues of B. 

• Step 3: Compute singular vectors of Matrix A 

It can be deduced from step 1 that A = Q*B. Multiplying both sides from the right 
by b' yields the following formula: A*b' = Q*B*b'. By applying the decomposition 
performed in step 2: 


A* B = Q*U *11*U . 

Multiplying both sides by (5 

A = Q*U*11*U *{B!y^. 

Since the SVD of A is defined as A = U *11* v' . Therefore, the left singular vectors U 
and the right singular vectors v' can be defined as: 

U = Q*U, v' = U *{By\ 

Time Complexity: Stage I generally dominates the cost of Stage II in SSVD. Within Stage 
I, the computational bottleneck is usually the matrix-matrix product in Step 2. When the matrix 
Q is standard Gaussian, the cost of this multiplication is 0{NDd). However, [fTOll shows that 
ssvd perform such multiplication in 0{ND x log{d)) only. The key idea is to use a structured 
random matrix that allows to compute the product in 0{ND x log{d)). As described by [fTOl . the 
simplest example of structured matrices is the subsampled random Fourier transform (SRFT), 
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An SRFT is an D x J matrix that has a specific form and properties described in IfTOll . When 
the random matrix is SRFT, The sample matrix Z = A*Q can be computed in 0{ND x log{d)) 
using subsampled FFT |[T0l . 

Communication Complexity: The algorithm requires storing several intermediate matrices. 
Such matrices include N x d matrix Z, N xd matrix Q, d x d matrix R, dxd matrix B, dxd 
matrix U. The total intermediate data can be calculated as 0{max{Nd,d^)). 

VI. Probalistic Principal component analysis (PPCA) 

Probabilistic PCA (PPCA) [[T9l is a probabilistic approach to computing principal components 
of a dataset. In PPCA, PCA is represented as a latent variable model that seeks a linear relation 
between a D-dimensional observed data vector y and a J-dimensional latent variable x. The 
model can be described by the equation: 

y = C*x + jU + £, 

where Cis a Dxd transformation matrix (i.e, the columns of C are the principal components), jj, 
is the vector mean of y, and £ is white noise to compensate for errors. Moreover, [[T^ proposed 
an Expectation Maximization (EM) algorithm for estimating the principal components iteratively 
for this latent variable model. PPCA has the following advantages: 

• Mathametically proven model for PCA 

• Linear complexity in the number of data points, data dimensions and the required number 
of principal components ifT^ 

• Big data is more susceptible to having missing values. Since PPCA uses EM, PCA projec¬ 
tions can be obtained when some data values are missing. 

• Multiple PCA models can be combined as a probabilistic mixture which allows PCA to 
express complex models 

Given a set of n-dimensional observed data vectors Y, a set of J-dimensional hidden vectors 
X, a set of unknown parameters C and a liklihood function p{Y\X^C). The intuition behind 
the algorithm is described in ifTTll as follows. Eirst, make a random guess for the principal 
components C. Second, fix the guessed components and project the data vectors Y into it in 
order to give the hidden states X. Third, fix the values of the hidden states X and choose C that 
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maximizes p{Y\X,C). The last two steps are then repeated until convergence. The EM algorithm 
finds C that maximizes the liklihood function p(Y\X^C). The main steps are as follows: 

1) E-Step: Given current parameters calculate the expected value of the log liklihood 
function E[log{p{Y\X 

2) M-Step: Find the parameters that maximizes the expected value 

c('+i) =£[/og(p(y|x,cW)]. 

3) Repeat steps 1 and 2 until convergence. 

A standard method for testing whether the algorithm converged or not is by computing the 
1-Norm of the reconstruction error, which is given by: e = ||T — A *C~^||i. 

Time Complexity: The analysis presented in IfT^ shows that EM-based PCA has linear 
complexity in the number of data points, dimensions and required principal components; Similar 
studies in m and lIH show that the EM algorithm used to solve PPCA has a computational 
complexity of O(NDd) per iteration. Furthermore, an empirical study conducted by [[T71l showed 
that the number of iterations needed by PPCA to converge is almost constant with changing the 
dimensionality D. Therfore the time complexity of one iteration is sufficient to describe the time 
complexity of the whole algorithm. 

Communication Complexity: The communication complexity of PPCA is bounded by the 
matrix of hidden states X, which is of size N x d. Therefore, the worst case complexity of 
PPCA is 0{Nd). However, the work in [|3 shows that this complexity can be reduced. In this 
work, we present our algorithm, sPCA, which is based on probabilistic PCA. We show that the 
communication complexity of the algorithm is 0{Dd) without changing any theoritical gurantees 
provided by the original probabilistic PCA. sPCA achieves this by redundantly recomputing 
some large matrices instead of storing them as intermediate data. Although these redundant 
computations increase the computational complexity that has been described previously, such 
redundant computations are repeated two times for each iteration. Therefore, the computation 
complexity becomes 0{3NDd) which does not change the assymptotic complexity of the PPCA 
algorithm. 


VII. Summary of the analysis 

The summary of our analysis is shown in Table HI As the table shows, the time complexities 
of the top two methods (eigen decomposition of covariance matrix and SVD of bi-diagonalized 
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Method to Compute PCA 

Time Complexity 

Communication Complexity 

Example Libraries 

Eigen decomp, of covariance matrix 

0{NDxmin{N,D)+D^) 

0(d2) 

MLlib-PCA (Spark), 

RScaLAPACK 

SVD-Bidiag (6) 

0{ND^+D'^) 

0{max{{N + D)d,D-)) 

RScaLAPACK 

Stochastic SVD (SSVD) (TO) 

0{NDd) 

0{max{Nd,d^)) 

Mahout-PCA (MapReduce) 

Probabilistic PCA (PPCA) (20) 

0{NDd) 

0(Dd) 

sPCA 0 


TABLE I 

Comparison of different methods eor computing PCA op an A x D matrix to produce d principal 

COMPONENTS. 


matrix) are approximately cubic in terms of the dimensions of the input matrix, assuming N ^ D. 
Such high time complexities prevent these methods from scaling to large datasets. Even if D <N, 
the time complexity is a function of N (number of data points) multiplied by (number of 
dimensions of each data point), which is still quite high for many datasets with large number of 
dimensions. In addition, the communication complexities of these two methods are also quite 
high, especially for high dimensional datasets. Therefore, even if there are enough computing 
nodes to handle the high computational costs, the communication costs can still hinder the 
scalability of these two methods. 

The last two methods in Table H] (stochastic SVD and probabilistic PCA) have a more efficient 
time complexity of 0{ND), assuming that J is a relatively small constant, which is typically 
the case in many real applications. Thus, these two approaches are potential candidates for 
performing PCA for large datasets. Our analysis and experimental evaluation [|71, however, 
reveal that even though the time complexity of stochastic SVD can be handled by employing 
more computing nodes, it can suffer from high communication complexity. For example, our 
experiments show that the high communications complexity of Mahout-PCA (which uses SSVD) 
prevents it from processing datasets in the order of tens of GBs. Therefore, based on our analysis, 
the most promising PCA approach for large datasets is the probabilistic PCA. 

VIII. Conclusion 

In this report, we analyzed different methods for computing the principal components of a 
given dataset, which is referred to as principal component analysis (PCA). The analysis showed 
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that both the computational complexity as well as the communications complexity are important 
aspects for processing large-scale datasets on distributed platforms. Our analysis indicated that 
the two methods: eigen decomposition of covariance matrix and SVD of bi-diagonalized matrix 
are omputationally intensive as their time complexities are either cubic in terms of the dimensions 
of the input matrix or a function of N (number of data points) multiplied by (number of 
dimensions of each data point), which is quite high for many datasets. On the other hand, the 
analysis showed that Stocahstic SVD (SSVD) and Probablistic PCA are two potential candidates 
for performing PCA on large datasets, since they have the best computational complexity. 
However, SSVD suffers from high communication complexity. Therefore, the most promising 
PCA approach for large datasets is the probabilistic PCA. In our recent work [|71, we have 
designed a distributed and scalable principal component analysis algorithm called sPCA and it 
is based on probabilistic principal component analysis. sPCA substantially outperforms current 
PCA algorithms implemented in Mahout/MapReduce and MLlib/Spark. 
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